Day 3:
Dynamic analysis of
ultrasound
tongue imaging data
Preliminaries
Loading packages
# data handling and visualisation
library(tidyverse)
library(lubridate)
library(ggpubr)
# FPCA + LMEs
library(fdapace)
library(lme4)
library(lmerTest)
library(emmeans)
# GAMMs
library(mgcv)
library(itsadug)
library(tidygam) # for 'get_gam_predictions'
source("https://raw.githubusercontent.com/soskuthy/gamm_intro/master/gamm_hacks.r")Loading data
The tongue spline data were exported from the AAA software and stored in the spline folder. The following code looks through the spline folder, read each .txt and append it to the earlier ones.
## loading data
# define the path
dir <- getwd()
# index wav files in the directory
file_list <- list.files(paste0(dir, "/ultrasound/spline"), pattern = ".txt", full.names = FALSE)
# create an empty list to store data
data_list <- list()
for(i in 1:length(file_list)){
current_data <- read.delim(paste0(dir, "/ultrasound/spline/", file_list[i]), header = FALSE)
data_list[[i]] <- current_data
}
# bind all data from the list into a data frame
dat <- bind_rows(data_list)
# omit na
dat <- na.omit(dat)Renaming columns
Let’s rename the column names. It is assumed that you have exported tongue shape at 11 equidistant time points throughout the interval. Also, the code here assumes that you have exported the following variables:
- Client name (Surname, First name, full name - whichever you prefer)
- Date and time of recording (not time within recording)
- Prompt
- Label (usually made on Praat TextGrid and then imported to AAA)
- X, Y values of spline DLC_Tongue (not Tongue - this is for XY values derived from older tongue surface tracking methods)
The DeepLabCut plug-in on AAA tracks midsagittal tongue splines based on 11 points along the tongue surface. This means that we will obtain 22 values - X/Y for point 1, X/Y for point 2, and so on. The code below specifies this with Point 1 at the vallecula (i.e., a little ‘valley’ formed by the roots of the tongue and the epiglottis) and moving forward as the number increases until Point No. 11 = tongue tip. For further information about the tracking points, please refer to the following paper:
Wrench, A., & Balch-Tomes, J. (2022). Beyond the Edge: Markerless Pose Estimation of Speech Articulators from Ultrasound and Camera Images Using DeepLabCut. Sensors, 22(3), Article 3. https://doi.org/10.3390/s22031133
dat <- dat |>
janitor::remove_empty(which = "cols") |> # remove empty columns
dplyr::rename(
speaker = V1, # speaker ID
rec_date = V2, # date/time associated with each recording
prompt = V3, # alias of prompt and repetition
interval = V4 # textgrid label
) |>
dplyr::rename(
X_1 = V5, # vallecula
Y_1 = V6,
X_2 = V7, # tongue root 1
Y_2 = V8,
X_3 = V9, # tongue root 2
Y_3 = V10,
X_4 = V11, # tongue body 1
Y_4 = V12,
X_5 = V13, # tongue body 2
Y_5 = V14,
X_6 = V15, # tongue dorsum 1
Y_6 = V16,
X_7 = V17, # tongue dorsum 2
Y_7 = V18,
X_8 = V19, # tongue blade 1
Y_8 = V20,
X_9 = V21, # tongue blade 2
Y_9 = V22,
X_10 = V23, # tongue tip 1
Y_10 = V24,
X_11 = V25, # tongue tip 2
Y_11 = V26
)
# eliminate one token that was not labelled properly
dat <- dat |>
filter(!str_detect(rec_date, "2022/11/29 17:48:39")
)
# Specifying which time point of 11 each tongue spline is associated with.
dat <- dat |>
group_by(rec_date, speaker) |>
mutate(prop_time = row_number() - 1) |> # so that first row = 0
ungroup() |>
mutate(prop_time = prop_time * 10) |> # to give percentages
mutate(prop_time = factor(prop_time)) |>
mutate(perc_time = paste(prop_time, "%", sep = "") |> factor())
# Rename the prompt '18' into 'biribiri' as AAA wasn't able to handle the Japanese characters.
dat$prompt[dat$prompt == "18"] <- "biribiri"
# check data
dat## # A tibble: 2,981 × 28
## speaker rec_date prompt interval X_1 Y_1 X_2 Y_2 X_3 Y_3 X_4
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4ps8zx 2022/11/… berea… iri -18.9 -28.6 -15.7 -16.8 -8.28 -7.53 -0.995
## 2 4ps8zx 2022/11/… berea… iri -19.2 -30.5 -17.8 -17.9 -10.7 -8.23 -2.42
## 3 4ps8zx 2022/11/… berea… iri -20.3 -31.1 -20.0 -18.5 -13.5 -8.29 -4.29
## 4 4ps8zx 2022/11/… berea… iri -22.6 -30.3 -23.9 -17.9 -16.5 -8.34 -4.71
## 5 4ps8zx 2022/11/… berea… iri -22.9 -30.4 -23.7 -17.8 -15.2 -8.95 -2.92
## 6 4ps8zx 2022/11/… berea… iri -20.1 -31.0 -19.2 -18.5 -10.4 -8.41 -0.117
## 7 4ps8zx 2022/11/… berea… iri -16.9 -31.8 -15.5 -19.2 -7.16 -8.83 1.14
## 8 4ps8zx 2022/11/… berea… iri -15.8 -31.8 -12.6 -19.6 -4.42 -8.62 2.13
## 9 4ps8zx 2022/11/… berea… iri -16.1 -30.3 -11.3 -19.3 -4.29 -8.25 1.06
## 10 4ps8zx 2022/11/… berea… iri -15.4 -30.9 -10.9 -19.7 -4.39 -8.51 1.53
## # ℹ 2,971 more rows
## # ℹ 17 more variables: Y_4 <dbl>, X_5 <dbl>, Y_5 <dbl>, X_6 <dbl>, Y_6 <dbl>,
## # X_7 <dbl>, Y_7 <dbl>, X_8 <dbl>, Y_8 <dbl>, X_9 <dbl>, Y_9 <dbl>,
## # X_10 <dbl>, Y_10 <dbl>, X_11 <dbl>, Y_11 <dbl>, prop_time <fct>,
## # perc_time <fct>
Coverting into long format
Long format is easier to handle with tidyverse, so let’s convert the data set here.
## pivot longer: Use this when you work on the tongue only
dat_long <- dat |>
pivot_longer(5:26, names_to = c(".value", "point_number"), names_sep = "_")
# check
dat_long## # A tibble: 32,791 × 9
## speaker rec_date prompt interval prop_time perc_time point_number X
## <chr> <chr> <chr> <chr> <fct> <fct> <chr> <dbl>
## 1 4ps8zx 2022/11/29 … berea… iri 0 0% 1 -18.9
## 2 4ps8zx 2022/11/29 … berea… iri 0 0% 2 -15.7
## 3 4ps8zx 2022/11/29 … berea… iri 0 0% 3 -8.28
## 4 4ps8zx 2022/11/29 … berea… iri 0 0% 4 -0.995
## 5 4ps8zx 2022/11/29 … berea… iri 0 0% 5 7.22
## 6 4ps8zx 2022/11/29 … berea… iri 0 0% 6 18.8
## 7 4ps8zx 2022/11/29 … berea… iri 0 0% 7 31.2
## 8 4ps8zx 2022/11/29 … berea… iri 0 0% 8 36.2
## 9 4ps8zx 2022/11/29 … berea… iri 0 0% 9 40.6
## 10 4ps8zx 2022/11/29 … berea… iri 0 0% 10 44.4
## # ℹ 32,781 more rows
## # ℹ 1 more variable: Y <dbl>
Adding repetition
It is often the case that we have participants say each prompt multiple times. Personally, I find it easier to manage data using repetition when I need to omit some errorneous tokens - I usually make notes on when each participant makes an error while recording and filter them out later.
Although AAA does not export information on repetition (as far as I know), we can easily add repetition based on the date/time of recording.
## 1. Getting R to recognise dates and time with the lubridate package
dat_long$rec_date <- lubridate::parse_date_time(dat_long$rec_date, c("Ymd HMS","dmY HM"))
# multiple formats are specified here because some rec_dates look weird when exported from AAA
## 2. create a summary of the data using summarise(), add repetition information (as it's easier) and store it in the object 'rep'
rep <- dat_long |>
dplyr::group_by(speaker, prompt, rec_date) |>
dplyr::summarise() |>
dplyr::mutate(
repetition = row_number()
) |>
ungroup()
## 3. merge 'rep' with the main data ('df.long' here) using 'left_join()'
dat_long <- dplyr::left_join(dat_long, rep, by = c("speaker", "prompt", "rec_date"))Repetition is indeed useful here, as all recordings from the first attempt for the speaker ``2d57ke’’ need to be eliminated; I refit the probe and ultrasound headset to improve the quality of tongue imaging. When you move probe, you need to redo bite plane measurement because it changes the probe angle. So, let’s remove recordings from the speaker 2d57ke’s first attempt.
Creating tongue spline ID
In addition, although rec_date can act as a unique identifier for
each token, I don’t particularly find it efficient to continue to use
this. I usually create something called token ID by
combining speaker, prompt and repetition.
In addition, when plotting tongue spline, you will need some grouping
but this can sometimes be quite tedious. I find it useful to define a
unique id for each tongue spline (i.e., a set of 22 data points - 11
points for each XY). Later on, you can group data by this
spline ID to avoid ggplot from misunderstanding data
structure. I combine token_id and
proportion_time.
dat_long <- dat_long |>
dplyr::mutate(
token_id = paste(speaker, prompt, sep = "_"),
token_id = paste(token_id, repetition, sep = "_")
)
### a combination of "rec_date" and "prop_time" gives unique label for each spline
dat_long <- dat_long |>
dplyr::mutate(
spline_id = str_c(dat_long$token_id, dat_long$prop_time, sep = "_")
)Data visualisation: tongue shape
The following code produces Figure 3 on my ICPhS paper. Note the
grouping variables in the geom_path() function.
# Converting the speaker ID into more straightforward naming convention
dat_long_plot <- dat_long |>
dplyr::filter(speaker %in% c("jcy8xi", "3bcpyh", "kjn9n4"))
dat_long_plot$speaker[dat_long_plot$speaker == "jcy8xi"] <- "English A"
dat_long_plot$speaker[dat_long_plot$speaker == "3bcpyh"] <- "Japanese A"
dat_long_plot$speaker[dat_long_plot$speaker == "kjn9n4"] <- "Japanese B"
# Plotting the tongue splines
tongue_plot <- dat_long_plot |>
na.omit() |>
dplyr::group_by(speaker) |>
ggplot2::ggplot() +
# grouping based on spline_id to get R to recognise that a tongue shapes consist of 11 DLC knots
geom_path(aes(x = X_z, y = Y_z, group = spline_id, colour = as.numeric(prop_time)), linewidth = 1, show.legend = TRUE) +
geom_hline(yintercept = 0, linetype = 3) +
theme_classic() +
scale_colour_distiller(palette = "RdYlBu", guide = guide_colourbar(name = "Proportional time (%)", title.position = "top", title.hjust = 0.5)) +
facet_grid(factor(speaker, levels = c("English A", "Japanese A", "Japanese B")) ~ prompt) +
labs(x = "X", y = "Y", color='Proportional time (%)') +
theme(plot.title = element_text(size = 20, hjust = 0.5, face = 'bold'),
axis.text = element_text(size = 12),
axis.title = element_text(size = 15),
strip.text.x = element_text(size = 20),
strip.text.y = element_text(size = 20),
legend.text = element_text(size = 15),
legend.position = "bottom",
legend.title = element_text(size = 15, hjust = 0.5),
legend.key.width = unit(2, "cm")
)
# Showing the plot
tongue_plotPrincipal Component Analysis (PCA) on midsagittal tongue shape
While the visual inspection makes it fairly obvious that there are some articulatory differences between English and Japanese speakers, a challenging aspect in ultrasound data analysis to quantify such differences.
One way of quantification is through principal component analysis (PCA). PCA here identifies the key modes of variation in the 11 DLC knots, and by projecting PCs back onto the original scale, we can infer the salient articulatory dimensions observed in the data.
This approach is explained in the following book:
Johnson, K. (2008). Quantitative Methods In Linguistics. Chichester, England: Wiley-Blackwell.
See particularly: Chapter 3 Phonetics (Section 3.3 Tongue Shape Factors: Principal Component Analysis, pp. 95-102).
Running PCA using prcomp()
Let’s run PCA onto the x/y coordinates for 11 DLC knots. I try using
the prcomp() function here to be consistent with our
approach in the workshop, as opposed to the princomp()
function originally used in my ICPhS paper.
The PCA analysis below suggests that PC1 captures 58.03% of the
variance observed in the data, followed by PC2 (22.65%) and PC3 (7.66%).
This suggests that both prcomp() and
princomp() outputs very similar results.
# perform PCA
pca_results <- prcomp(dplyr::select(dat, 5:26), scale = FALSE)
# summary
summary(pca_results)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 21.6260 13.5115 7.85552 5.8729 5.24669 4.29483 2.19316
## Proportion of Variance 0.5803 0.2265 0.07657 0.0428 0.03416 0.02289 0.00597
## Cumulative Proportion 0.5803 0.8068 0.88342 0.9262 0.96038 0.98327 0.98924
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.56582 1.38654 1.11780 1.02112 0.71898 0.66939 0.53253
## Proportion of Variance 0.00304 0.00239 0.00155 0.00129 0.00064 0.00056 0.00035
## Cumulative Proportion 0.99228 0.99466 0.99622 0.99751 0.99815 0.99871 0.99906
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.46951 0.39128 0.38082 0.30625 0.26579 0.18377 0.15973
## Proportion of Variance 0.00027 0.00019 0.00018 0.00012 0.00009 0.00004 0.00003
## Cumulative Proportion 0.99933 0.99952 0.99970 0.99982 0.99991 0.99995 0.99998
## PC22
## Standard deviation 0.12867
## Proportion of Variance 0.00002
## Cumulative Proportion 1.00000
Understanding data
Scree plot
Here let’s inspect the data. As a rule of thumb, we can use a 5%
threshold of the proportion of variance as recommended by Bayeen (2008).
The scree plot just visualises the proportion of variance explained by
each PC that we just saw using summary().
# Plotting the variance explained
var_explained <- pca_results$sdev^2 / sum(pca_results$sdev^2)
# making var_explained as a tibble and add colname
var_explained <- tidyr::as_tibble(var_explained)
# adding PC number
var_explained <- var_explained |>
tidyr::as_tibble() |>
dplyr::mutate(
PC = row_number()
)
# only plot PC10 or below
var_explained <- var_explained |>
dplyr::filter(PC < 11)
# create scree plot
scree <- var_explained |>
ggplot(mapping = aes(x = PC, y = value)) +
geom_line() +
geom_point(data = subset(var_explained)) +
geom_text(data = subset(var_explained, PC < 4), aes(label = round(value, digits = 5)), nudge_x = 0.8) +
geom_label(data = subset(var_explained, PC < 4), aes(label = PC), label.padding = unit(0.40, "lines")) +
geom_hline(yintercept = 0.05, linetype = 'dotted') +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Proportion of Variance explained by each PC") +
ylim(0, 0.6)
# show plot
screeProjecting PCs back onto midsagittal tongue shape
OK, now the question is what articulatory movement is captured by each PC. As explained in Johnson (2008, p. 100), it is possible to project PC dimensions back on the original scale. In our case, we can reconstruct the i-th data point along the tongue surface (e.g., DLC knot 1, 2, 3 etc in our case) and, by extension, visualise the tongue shape at one particular time point decomposed by each PC using the following formula:
\[ x_{ji} = \bar{x}_i ± s_j * w_{ji} \]
where \(\bar{x}\) denotes the mean, s means standard deviation (SD) and w denotes the PC loading for the \(PC_j\) dimension.
This means that we’ll need to extract (1) mean PC scores, (2) standard deviation and (3) PC loadings from the PCA results.
Extracting mean and loadings
The mean values are stored in pca_results$center, and
the loadings in pca_results$rotation. In order to better
understand the structure of a data object, it is useful to inspect it
using str().
## List of 5
## $ sdev : num [1:22] 21.63 13.51 7.86 5.87 5.25 ...
## $ rotation: num [1:22, 1:22] 0.2322 0.0762 0.3381 0.0492 0.35 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:22] "X_1" "Y_1" "X_2" "Y_2" ...
## .. ..$ : chr [1:22] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:22] -13.9 -30.81 -11.54 -18.43 -8.37 ...
## ..- attr(*, "names")= chr [1:22] "X_1" "Y_1" "X_2" "Y_2" ...
## $ scale : logi FALSE
## $ x : num [1:2981, 1:22] 5.59 1.78 -4.71 -5.83 -1.87 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:22] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
# Mean values from the output of the PCA
pca_mean <- tibble::enframe(pca_results$center)
# subset data to make into a matrix of x and y values
pca_mean <- pca_mean |>
dplyr::mutate(
axis_mean = str_sub(name, start = 1, end = 1),
number_mean = str_sub(name, start = 3, end = -1)
) |>
dplyr::select(-name) |>
dplyr::relocate(axis_mean, number_mean)
# extract X coordinate values of the mean values
mean_X <- pca_mean |>
dplyr::filter(axis_mean == "X") |>
dplyr::rename(mean_X = value,
axis_mean_X = axis_mean,
number_mean_X = number_mean)
# extract Y coordinate values of the mean values
mean_Y <- pca_mean |>
dplyr::filter(axis_mean == "Y") |>
dplyr::rename(mean_Y = value,
axis_mean_Y = axis_mean,
number_mean_Y = number_mean)
# combine X and Y values together
mean_pca <- cbind(mean_X, mean_Y)
## get loadings - eigenvectors
loadings <- as.table(pca_results$rotation)Loadings
Let’s only extract PC loadings that are relevant here. We’ll focus only on PC1 and PC2 as they jointly account for the majority of variance observed in the data (80.68%) and they should be adequate for us to understand the overall articulatory patterns.
PC1
Here is the code to gather necessary information for PC1.
# get loadings for PC1 in a sensible format
PC1_loadings <- as.data.frame(loadings) |>
dplyr::filter(Var2 == "PC1")
# separate axis and number for each DLC knot
PC1_loadings <- PC1_loadings |>
dplyr::mutate(
axis_loadings = str_sub(Var1, start = 1, end = 1),
number_loadings = str_sub(Var1, start = 3, end = -1)
)
## separating X coordinates
PC1_loadings_X <- PC1_loadings |>
dplyr::filter(axis_loadings == "X") |>
dplyr::select(-c(Var1, Var2)) |>
dplyr::relocate(axis_loadings, number_loadings) |>
# raname variables -- we'll combine everything later but R doesn't allow multiple columns to have identical names
dplyr::rename(PC1_loadings_X = Freq,
axis_loadings_X = axis_loadings,
number_loadings_X = number_loadings)
## separating Y coordinates
PC1_loadings_Y <- PC1_loadings |>
dplyr::filter(axis_loadings == "Y") |>
dplyr::select(-c(Var1, Var2)) |>
dplyr::relocate(axis_loadings, number_loadings) |>
# raname variables -- we'll combine everything later but R doesn't allow multiple columns to have identical names
dplyr::rename(PC1_loadings_Y = Freq,
axis_loadings_Y = axis_loadings,
number_loadings_Y = number_loadings)
## bind the X and Y data frames together
PC1_loadings <- cbind(PC1_loadings_X, PC1_loadings_Y)PC2
The same for PC2.
## get loadings for PC2 in a sensible format
PC2_loadings <- as.data.frame(loadings) |>
dplyr::filter(Var2 == "PC2")
# separate axis and number for each DLC knot
PC2_loadings <- PC2_loadings |>
dplyr::mutate(
axis_loadings = str_sub(Var1, start = 1, end = 1),
number_loadings = str_sub(Var1, start = 3, end = -1)
)
## separating X coordinates
PC2_loadings_X <- PC2_loadings |>
dplyr::filter(axis_loadings == "X") |>
dplyr::select(-c(Var1, Var2)) |>
dplyr::relocate(axis_loadings, number_loadings) |>
# raname variables -- we'll combine everything later but R doesn't allow multiple columns to have identical names
dplyr::rename(PC2_loadings_X = Freq,
axis_loadings_X = axis_loadings,
number_loadings_X = number_loadings)
## separating Y coordinates
PC2_loadings_Y <- PC2_loadings |>
dplyr::filter(axis_loadings == "Y") |>
dplyr::select(-c(Var1, Var2)) |>
dplyr::relocate(axis_loadings, number_loadings) |>
# raname variables -- we'll combine everything later but R doesn't allow multiple columns to have identical names
dplyr::rename(PC2_loadings_Y = Freq,
axis_loadings_Y = axis_loadings,
number_loadings_Y = number_loadings)
## bind the X and Y data frames together
PC2_loadings <- cbind(PC2_loadings_X, PC2_loadings_Y)Standard deviation
We also need standard deviation associated with each PC. This is
stored in pca_results$sdev.
Calculating max and min values
The data visualisation will show the each PC dimensions by adding and subtracting the variations (i.e., \(SD * loadings\)) to/from the mean tongue shape, as per Johnson’s (2008) formula. The following code calculates these for X and Y coordinates separately.
# calculate estimated values including sd
## PC1
estimate_PC1 <- cbind(mean_pca, PC1_loadings)
estimate_PC1$PC1_max_X <- estimate_PC1$mean_X + sd_PC1*estimate_PC1$PC1_loadings_X
estimate_PC1$PC1_min_X <- estimate_PC1$mean_X - sd_PC1*estimate_PC1$PC1_loadings_X
estimate_PC1$PC1_max_Y <- estimate_PC1$mean_Y + sd_PC1*estimate_PC1$PC1_loadings_Y
estimate_PC1$PC1_min_Y <- estimate_PC1$mean_Y - sd_PC1*estimate_PC1$PC1_loadings_Y
# calculate estimated values including sd
## PC2
estimate_PC2 <- cbind(mean_pca, PC2_loadings)
estimate_PC2$PC2_max_X <- estimate_PC2$mean_X + sd_PC2*estimate_PC2$PC2_loadings_X
estimate_PC2$PC2_min_X <- estimate_PC2$mean_X - sd_PC2*estimate_PC2$PC2_loadings_X
estimate_PC2$PC2_max_Y <- estimate_PC2$mean_Y + sd_PC2*estimate_PC2$PC2_loadings_Y
estimate_PC2$PC2_min_Y <- estimate_PC2$mean_Y - sd_PC2*estimate_PC2$PC2_loadings_YData visualisation
Now we are all set! Let’s visualise the tongue shape to inspect what articulatory dimensions are captured by each PC.
# PC1
PC1_reconstructed <- ggplot() +
geom_line(data = estimate_PC1, aes(x = mean_X, y = mean_Y), size = 1.5) +
geom_line(data = estimate_PC1, aes(x = PC1_max_X, y = PC1_max_Y), linetype = "dashed", linewidth = 1, alpha = 0.5, colour = "#E69F00") +
geom_line(data = estimate_PC1, aes(x = PC1_min_X, y = PC1_min_Y), linetype = "dotted", linewidth = 1, alpha = 0.5, colour = "#56B4E9") +
geom_point(data = estimate_PC1, aes(x = PC1_max_X, y = PC1_max_Y), shape = 3, size = 3, stroke = 2, colour = "#E69F00") +
geom_point(data = estimate_PC1, aes(x = PC1_min_X, y = PC1_min_Y), shape = "\u2212", size = 5, stroke = 8, colour = "#56B4E9") +
labs(x = "X", y = "Y", title = "PC1")
# PC2
PC2_reconstructed <- ggplot() +
geom_line(data = estimate_PC2, aes(x = mean_X, y = mean_Y), size = 1.5) +
geom_line(data = estimate_PC2, aes(x = PC2_max_X, y = PC2_max_Y), linetype = "dashed", linewidth = 1, alpha = 0.5, colour = "#E69F00") +
geom_line(data = estimate_PC2, aes(x = PC2_min_X, y = PC2_min_Y), linetype = "dotted", linewidth = 1, alpha = 0.5, colour = "#56B4E9") +
geom_point(data = estimate_PC2, aes(x = PC2_max_X, y = PC2_max_Y), shape = 3, size = 3, stroke = 2, colour = "#E69F00") +
geom_point(data = estimate_PC2, aes(x = PC2_min_X, y = PC2_min_Y), shape = "\u2212", size = 5, stroke = 8, colour = "#56B4E9") +
labs(x = "X", y = "Y", title = "PC2")
# showing plots
ggpubr::ggarrange(PC1_reconstructed, PC2_reconstructed, nrow = 1)So, this is quite similar to Figure 1 of the ICPhS paper: PC1 captures front-back movement of the tongue, in which larger PC1 values indicate more tongue fronting. PC2 captures high-low dimension, in which smaller PC2 values indicate higher tongue shape.
Dynamic changes in PC scores
Since the input to the PCA analysis here involves temporal information (i.e., tongue shapes extracted at 11 temporal points during the vowel-liquid-vowel interval), it is possible to infer dynamic changes in tongue shape for each production token. We can do so by tracking changes in PC scores for each PC dimension.
Extracting PC scores
Let’s extract PC scores associated with the PC1 and PC2 dimensions.
PC scores are stored in pca_results$x.
## List of 5
## $ sdev : num [1:22] 21.63 13.51 7.86 5.87 5.25 ...
## $ rotation: num [1:22, 1:22] 0.2322 0.0762 0.3381 0.0492 0.35 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:22] "X_1" "Y_1" "X_2" "Y_2" ...
## .. ..$ : chr [1:22] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:22] -13.9 -30.81 -11.54 -18.43 -8.37 ...
## ..- attr(*, "names")= chr [1:22] "X_1" "Y_1" "X_2" "Y_2" ...
## $ scale : logi FALSE
## $ x : num [1:2981, 1:22] 5.59 1.78 -4.71 -5.83 -1.87 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:22] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
# Get the results of the PCA which are useful
pc_score <- pca_results$x
# Put it into a sensible format as some variables come out weird
pc_score <- as_tibble(pc_score) |>
dplyr::select(1:4)
# Combine with non-numeric information from earlier
dat_pc <- cbind(dat, pc_score)
# normalise PCs by speaker for comparison
pca_result <- dat_pc |>
dplyr::group_by(speaker) |>
dplyr::mutate(
PC1z = scale(PC1),
PC2z = scale(PC2),
PC3z = scale(PC3),
PC4z = scale(PC4)
) |>
dplyr::ungroup()Data wrangling
We’ll transform the data into a long format so that we’ll be able to
handle it better for data visualisation using ggplot2.
# converting data frame into long format (and parsing rec_date)
pca_result_long <- pca_result |>
tidyr::pivot_longer(5:26, names_to = c(".value", "point_number"), names_sep = "_") |>
dplyr::mutate(
rec_date = lubridate::parse_date_time(rec_date, c("Ymd HMS","dmY HM"))
)
# integrate meta data by joining pca_result_long to dat_long
dat_long <- dplyr::left_join(dat_long, pca_result_long, by = c("speaker", "rec_date", "prompt", "interval", "prop_time", "perc_time", "point_number", "X", "Y"))
# add L1 information
dat_long <- dat_long |>
dplyr::mutate(
L1 = case_when(
speaker %in% c("4ps8zx", "5jzj2h", "5upwe3", "6p63jy", "cu2jce", "ds6umh", "h5s4x3", "jcy8xi", "m46dhf", "tay55n", "we8z58", "xub9bc") ~ "English",
TRUE ~ "Japanese"
)
)Data visualisation
Let’s visualise dynamic changes in PC scores. This is similar to Figures 2 but I’ve added raw trajectories as well.
To make the interpretation of PC2 more intuitive, I have flipped the symbol of the PC2z scores so that the positive PC2 values indicate tongue raising.
PC1 (front-back)
# PC1
dat_long |>
ggplot(aes(x = prop_time, y = PC1z)) +
geom_line(aes(group = token_id, colour = prompt), alpha = 0.4) +
geom_smooth(aes(group = prompt), colour = "white", linewidth = 3.0, alpha = 0.4) +
geom_smooth(aes(group = prompt, colour = prompt), linewidth = 1.5, alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, alpha = 0.4) +
annotate("text", x = 1.5, y = -3, label = "← Retraction", hjust = 1, vjust = 0, size = 3, angle = 90, fontface = "bold") +
annotate("text", x = 1.5, y = 4, label = "Fronting →", hjust = 1, vjust = 0, size = 3, angle = 90, fontface = "bold") +
scale_colour_manual(values = cbPalette) +
labs(x = "Proportional time (%)", y = "PC1 (z-normalised)", title = "Dynamic changes in PC1") +
facet_wrap(~ L1)PC2 (high-low)
# PC2
dat_long |>
# note the symbol inversion for PC2z
ggplot(aes(x = prop_time, y = -PC2z)) +
geom_line(aes(group = token_id, colour = prompt), alpha = 0.4) +
geom_smooth(aes(group = prompt), colour = "white", linewidth = 3.0, alpha = 0.4) +
geom_smooth(aes(group = prompt, colour = prompt), linewidth = 1.5, alpha = 0.4) +
annotate("text", x = 1.5, y = -1.5, label = "← Lowering", hjust = 1, vjust = 0, size = 3, angle = 90, fontface = "bold") +
annotate("text", x = 1.5, y = 3, label = "Raising →", hjust = 1, vjust = 0, size = 3, angle = 90, fontface = "bold") +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
labs(x = "Proportional time (%)", y = "PC2 (z-normalised; inverted)", title = "Dynamic changes in PC2 (inverted)") +
facet_wrap(~ L1)Statistical analysis
My ICPhS paper didn’t go beyond data visualisation. But we tried both FPCA and GAMMs in the statistics workshop this week, so why don’t we try them? I’ll try running stats on PC1.
The PCA analysis above compresses the movement of 11 DLC knots into one PC score for a tongue contour at each time point along each PC dimension. This means that we won’t need to keep the original data structure anymore, and this will cause some errors when running non-linear stats. So let’s subset the data set so that each row corresponds to one PC score at one time point.
dat_sub <- dat_long |>
dplyr::group_by(speaker, L1, rec_date, prompt, interval, prop_time, repetition, token_id) |>
# taking mean() but this shouldn't change the values because PC values are the same for all 11 DLC knots
dplyr::summarise(PC1z = mean(PC1z),
PC2z = mean(PC2z)) |>
dplyr::ungroup()We’ll also convert prop_time into numeric values.
Even the subset data frame should enable the same data visualisation – just as a sanity check:
dat_sub |>
ggplot(aes(x = prop_time, y = PC1z)) +
geom_line(aes(group = token_id, colour = prompt), alpha = 0.4) +
geom_smooth(aes(group = prompt), colour = "white", linewidth = 3.0, alpha = 0.4) +
geom_smooth(aes(group = prompt, colour = prompt), linewidth = 1.5, alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, alpha = 0.4) +
annotate("text", x = 1.5, y = -3, label = "← Retraction", hjust = 1, vjust = 0, size = 3, angle = 90, fontface = "bold") +
annotate("text", x = 1.5, y = 4, label = "Fronting →", hjust = 1, vjust = 0, size = 3, angle = 90, fontface = "bold") +
scale_colour_manual(values = cbPalette) +
labs(x = "Proportional time (%)", y = "PC1 (z-normalised)", title = "Dynamic changes in PC1") +
facet_wrap(~ L1)## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Looks good, let’s move onto FPCA!
FPCA
Creating FPCA input and run FPCA
As we saw earlier in the workshop, we first create FPCA input using
fdapace::MakeFPCAInputs(), and then check if the input has
an appropriate structure using fdapace::CheckData(). The
checking process would spit errors if prop_time wasn’t
numeric.
# IDs = token column; tVec = time column; yVec = variable column(s)
input_dat <- fdapace::MakeFPCAInputs(IDs = dat_sub$token_id, tVec = dat_sub$prop_time, yVec = dat_sub$PC1z)
# Check if there's any issues with the data
fdapace::CheckData(input_dat$Ly, input_dat$Lt)
# No errors have been returned, so let's now run FPCA
dat_dyn_fpca <- fdapace::FPCA(Ly = input_dat$Ly, Lt = input_dat$Lt, optns = list(plot = TRUE))
The mean function seems to capture the overall pattern, in which tongue
achieves maximal retraction earlier in the interval and then shows
fronting towards the vowel /i/.
Let’s quickly check what proportion of variance is captured by each FPC.
## [1] 0.5247190 0.8005069 0.9431233 0.9846224 0.9979597
FPC1 captures 52.74% of variance observed in the PC1 trajectories and FPC2 27.58% and FPC3 14.26%. The first three FPCs jointly explain 94.31% of variance.
Understanding FPCs
We’ll skip the intermediate checking stages and jump straight onto visualising the variation decomposed by each FPC.
# function: get PC scores + return data frame with PCs for each token
get_pc_scores <- function(fpcaObj){
pcs <- data.frame(fpcaObj$xiEst)
token <- names(fpcaObj$inputData$Lt)
df <- cbind(token, pcs)
n_pcs <- length(fpcaObj$lambda) # get number of PCs
pc_names <- paste0("PC", 1:n_pcs) # create colnames for PCs
names(df) <- c("token_id", pc_names) # add colnames for token + PCs
return(df)
}
# get PC scores w/ token info
pc_df <- get_pc_scores(dat_dyn_fpca)
# join PCs (dat) with selected cols from original data frame
## store meta info
meta <- dat_sub |>
dplyr::select(speaker, L1, rec_date, prompt, interval, prop_time, repetition, token_id)
## merge the list and meta data - unique(meta) because otherwise there would be lots of duplicates
dat <- left_join(pc_df, unique(meta), by = "token_id")
# function: define perturbation function (±Q = ±sd, k = PC number)
perturbation <- function(fpcaObj, Q, k){
Q * sqrt(fpcaObj$lambda[k]) * fpcaObj$phi[,k] + fpcaObj$mu
}
# function: create perturbation object with mean and ±Q sd as a data frame (for one PC only)
# can validate against fdapace::GetMeanCurve and fdapace::CreateModeOfVarPlot
perturbation_object <- function(fpcaObj, Q, k){
time <- fpcaObj$workGrid # grid of time values
mean <- fpcaObj$mu # mean trajectory
Qplus <- perturbation(fpcaObj, Q, k) # +Q sd
Qminus <- perturbation(fpcaObj, -Q, k) # -Q sd
df <- cbind(time, mean, Qplus, Qminus)
colnames(df) <- c("time", "mean", "Qplus", "Qminus")
df <- data.frame(df)
df$PC <- paste0("PC", k) # add PC colname
return(df)
}
# function: create perturbation data frame with mean and ±Q sd (for all PCs)
# to do: add ability to pass list of Q values for gradient perturbation function
get_perturbation <- function(fpcaObj, Q){
n_pcs <- length(fpcaObj$lambda)
k <- 1:n_pcs
df <- lapply(k, perturbation_object, fpcaObj=fpcaObj, Q=Q)
df <- dplyr::bind_rows(df) # unnest lists into single df
return(df)
}
# get mean trajectory and ±2 sd for all PCs
Q <- seq(from = -2, to = 2, by = 0.1)
pQ <- lapply(Q, get_perturbation, fpcaObj = dat_dyn_fpca) # change fpcaObj appropriately
names(pQ) <- Q # name pQ lists using value of Q
pQ <- dplyr::bind_rows(pQ, .id = "Q") # collapse lists together
pQ$Q <- as.numeric(pQ$Q) # make 'Q' column numeric
# visualise variation along each FPC with mean function
pQ <- pQ |>
dplyr::filter(PC %in% c('PC1','PC2', 'PC3')) |>
dplyr::mutate(
PC = case_when(
PC == "PC1" ~ "FPC1",
PC == "PC2" ~ "FPC2",
PC == "PC3" ~ "FPC3"
)
) |>
tidyr::pivot_longer(mean:Qminus, names_to = "Qsd", values_to = "value")
pQ |>
ggplot2::ggplot() +
aes(x = time, y = value, colour = Q, group = interaction(Q, Qsd)) +
geom_path() +
geom_line(data = pQ |> dplyr::filter(Qsd == "mean"), aes(x = time, y = value), colour = "black", size = 1) +
facet_wrap(~ PC, ncol = 2) +
scale_colour_gradient2(low = "#E69F00", mid = "#56B4E9", high = "#009E73", midpoint = 0)+
labs(x = "Time (normalised)", y = "Reconstructed PC1 trajectories", color = "FPC score")Reconstructing original trajectories
Let’s reconstruct the first three eigenfunctions.
# mean fPC1 trajectory
# pc1_mean_curve <- fdapace::GetMeanCurve(Ly = input.PC1$Ly, Lt = input.PC1$Lt, optns = list(plot = TRUE))
mu_values <- data.frame(dat_dyn_fpca$mu) # mean curve values
mu_time <- data.frame(dat_dyn_fpca$workGrid) # timepoints used for estimating the curve
phi <- data.frame(dat_dyn_fpca$phi) # eigenfunction at each timepoint: workGrid * nlambda (e.g., 255 = 51 workGrid * 5 lambda)
lambda <- data.frame(dat_dyn_fpca$lambda) # PC loadings for each PC: currently 5
# create a data frame containing mean curve, time and eigenfunctions assocaited with each PC at each time point
## add an extra column 'col_number' as a common index across the data frames - useful when merging everything together later on
### mean curve
mu_values <- mu_values |>
dplyr::mutate(
col_number = row_number()
)
### sampling time points
mu_time <- mu_time |>
dplyr::mutate(
col_number = row_number()
)
### eigenfunction
phi <- phi |>
dplyr::mutate(
col_number = row_number()
)
### pc loadings
lambda <- lambda |>
dplyr::mutate(
PC = str_c("PC", row_number()),
PC = str_c(PC, "lambda", sep = "_")
) |>
tidyr::pivot_wider(names_from = "PC", values_from = "dat_dyn_fpca.lambda") |>
dplyr::slice(rep(1:n(), each = 11)) |>
dplyr::mutate(
col_number = row_number()
)
## merging all data together one by one
rec <- dplyr::left_join(mu_values, mu_time, by = "col_number")
rec <- dplyr::left_join(rec, phi, by = "col_number")
rec <- dplyr::left_join(rec, lambda, by = "col_number")
## tidying up some column names
rec <- rec |>
dplyr::select(col_number, dat_dyn_fpca.workGrid, dat_dyn_fpca.mu, X1, X2, X3, X4, PC1_lambda, PC2_lambda, PC3_lambda, PC4_lambda) |>
dplyr::rename(
mean = dat_dyn_fpca.mu,
time = dat_dyn_fpca.workGrid,
PC1_eigen = X1,
PC2_eigen = X2,
PC3_eigen = X3,
PC4_eigen = X4
)
## plotting the eigenfunctions - this should match with a sub-plot in bottom right created with plot(PC1)
rec |>
ggplot() +
# geom_path(aes(x = time, y = mean)) +
geom_path(aes(x = time, y = PC1_eigen), colour = "black", linewidth = 1.5) +
geom_path(aes(x = time, y = PC2_eigen), colour = "red", linetype = 2, linewidth = 1.5) +
geom_path(aes(x = time, y = PC3_eigen), colour = "darkgreen", linetype = 3, linewidth = 1.5) +
# geom_path(aes(x = time, y = value, colour = pc)) +
geom_hline(yintercept = 0) +
labs(x = "time", y = "eigenfunctions", title = "First 3 eigenfunctions")Extracting FPC scores
We’ll also extract FPC scores and associate them with the meta data.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2105162 -1.23280330 -1.0410506 -0.2272518 -0.68786042
## [2,] -0.4783249 -1.06785381 -0.8186685 -0.4357777 -0.13441139
## [3,] -0.7786399 -0.53445253 -0.9730875 -0.4848629 -0.06312803
## [4,] 0.9797088 0.07333196 0.3737915 -0.9525886 -0.14125812
## [5,] 1.0910414 -1.34679083 0.4000563 -0.8101326 -0.08346305
## [6,] 1.8837205 2.16659020 0.9508972 0.1795540 -0.14285407
## token_id PC1 PC2 PC3 PC4 PC5 speaker
## 1 2zy9tf_biribiri_1 0.2105162 -1.232803 -1.041051 -0.2272518 -0.6878604 2zy9tf
## 2 2zy9tf_biribiri_1 0.2105162 -1.232803 -1.041051 -0.2272518 -0.6878604 2zy9tf
## 3 2zy9tf_biribiri_1 0.2105162 -1.232803 -1.041051 -0.2272518 -0.6878604 2zy9tf
## 4 2zy9tf_biribiri_1 0.2105162 -1.232803 -1.041051 -0.2272518 -0.6878604 2zy9tf
## 5 2zy9tf_biribiri_1 0.2105162 -1.232803 -1.041051 -0.2272518 -0.6878604 2zy9tf
## 6 2zy9tf_biribiri_1 0.2105162 -1.232803 -1.041051 -0.2272518 -0.6878604 2zy9tf
## L1 rec_date prompt interval prop_time repetition
## 1 Japanese 2022-10-20 13:59:31 biribiri idi 1 1
## 2 Japanese 2022-10-20 13:59:31 biribiri idi 2 1
## 3 Japanese 2022-10-20 13:59:31 biribiri idi 3 1
## 4 Japanese 2022-10-20 13:59:31 biribiri idi 4 1
## 5 Japanese 2022-10-20 13:59:31 biribiri idi 5 1
## 6 Japanese 2022-10-20 13:59:31 biribiri idi 6 1
# duplicate each row by 11 times
dat_time <- dat |>
dplyr::slice(rep(1:n(), each = 11))
# add col_names to merge with the other data frame
dat_time <- dat_time |>
dplyr::group_by(token_id) |>
dplyr::mutate(
col_number = row_number()
) |>
ungroup()
# merge
dat_time <- left_join(dat_time, rec, by = "col_number") |>
na.omit()Data visualisation
### FPC1
# reconstruct individual trajectory tokens
dat_time <- dat_time |>
dplyr::mutate(
PC1_reconstruct = PC1 * PC1_eigen + mean,
PC2_reconstruct = PC2 * PC2_eigen + mean,
PC3_reconstruct = PC3 * PC3_eigen + mean,
PC4_reconstruct = PC4 * PC4_eigen + mean
)
# visualise FPC1
dat_time |>
ggplot() +
geom_path(aes(x = time, y = PC1_reconstruct, group = token_id, colour = prompt), alpha = 0.2, show.legend = TRUE) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Proportional Time (%)", y = "Reconstructed PC1", title = "Reconstructed PC1 teajectories from FPC1") +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
facet_wrap(~ L1)
### FPC2
# visualise FPC2
dat_time |>
ggplot() +
geom_path(aes(x = time, y = PC2_reconstruct, group = token_id, colour = prompt), alpha = 0.2, show.legend = TRUE) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Proportional Time (%)", y = "Reconstructed PC1", title = "Reconstructed PC1 teajectories from FPC2") +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
facet_wrap(~ L1)FPC3
# visualise FPC3
dat_time |>
ggplot() +
geom_path(aes(x = time, y = PC3_reconstruct, group = token_id, colour = prompt), alpha = 0.2, show.legend = TRUE) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Proportional Time (%)", y = "Reconstructed PC1", title = "Reconstructed PC1 teajectories from FPC3") +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
facet_wrap(~ L1)Synthesise FPC1, FPC2 and FPC3
OK, so we’ve managed to understand between-group differences in the first three FPC dimensions. However, the big downside of this PCA+FPCA approach is that we have put the data into extremely abstract space. Both PCA and FPCA are data abstraction techniques, and we’ve applied abstraction over abstraction of the data.
One solution might be to reconstruct PC trajectories based on multiple FPC scores. After all, we are less interested in identifying the modes of variation in articulatory analysis – rather, we use FPCA primarily as a shape-to-number converter so that we can test between-group differences using statistical tests later on.
Remember that approximately 95% of variance of the data is accounted for by the first three FPCs. This means that, when we sum them up, we can minimise the degree of data loss, as the resulting reconstructed trajectories explain almost all the variance seen in the data.
# reconstruct individual trajectory tokens
dat_time <- dat_time |>
dplyr::mutate(
joint_PC1_PC2_PC3 = (PC1 * PC1_eigen) + (PC2 * PC2_eigen) + (PC3 * PC3_eigen) + mean
)
# visualise FPC1+FPC2+FPC3
dat_time |>
ggplot() +
geom_path(aes(x = time, y = joint_PC1_PC2_PC3, group = token_id, colour = prompt), alpha = 0.2, show.legend = TRUE) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Proportional Time (%)", y = "Reconstructed PC1", title = "Reconstructed PC1 teajectories from joint FPC1+2+3") +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
facet_wrap(~ L1)The resulting trajectories are rather jaggy – this is precisely because the use of multiple FPCs pick up smaller variabilities along each PC1 trajectory token. But at the same time, the trajectories still look smoother than the original PC1 trajectories – see the comparison below.
# compare with original PC1 trajectories
ggpubr::ggarrange(PC1_viz, joint_FPCs_viz, common.legend = TRUE, legend = "bottom")Finally, here’s the visualiastion of the joint FPC scores. The data points look quite sparse but this is because FPCA converts each entire trajectory into one number.
## [1] "token_id" "PC1" "PC2"
## [4] "PC3" "PC4" "PC5"
## [7] "speaker" "L1" "rec_date"
## [10] "prompt" "interval" "prop_time"
## [13] "repetition" "col_number" "time"
## [16] "mean" "PC1_eigen" "PC2_eigen"
## [19] "PC3_eigen" "PC4_eigen" "PC1_lambda"
## [22] "PC2_lambda" "PC3_lambda" "PC4_lambda"
## [25] "PC1_reconstruct" "PC2_reconstruct" "PC3_reconstruct"
## [28] "PC4_reconstruct" "joint_PC1_PC2_PC3"
dat_time <- dat_time |>
dplyr::mutate(
joint_FPCs = PC1 + PC2 + PC3
) |>
dplyr::group_by(speaker, L1, prompt, interval, rec_date, repetition, token_id) |>
dplyr::summarise(
joint_FPCs = unique(joint_FPCs)
) |>
dplyr::ungroup()
dat_time |>
ggplot(aes(x = L1, y = joint_FPCs)) +
geom_jitter(aes(colour = L1), alpha = 0.4) +
geom_violin(aes(fill = L1), alpha = 0.4) +
geom_boxplot(width = 0.3) +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
scale_colour_manual(values = cbPalette) +
scale_fill_manual(values = cbPalette) +
labs(x = "L1", y = "FPC1 + FPC2 + FPC3", title = "joint FPC values") +
facet_wrap( ~ prompt)Statistical analysis
Now we see that the joint FPC1+2+3 values capture the overall trend in PC1 trajectories quite well, so we can perform statistical analysis. We could simply run linear mixed-effect models here.
Preparing for LMEs
# converting variables into factor
dat_time <- dat_time |>
dplyr::mutate(
L1 = as.factor(L1),
speaker = as.factor(speaker),
prompt = as.factor(prompt)
)
# adding "segment" information
dat_time <- dat_time |>
dplyr::mutate(
segment = case_when(
interval == "ili" ~ "/l/",
interval == "iri" ~ "/ɹ/",
interval == "idi" ~ "/ɾ/"
),
segment = as.factor(segment)
)
# subset only English words
dat_time_en <- dat_time |>
dplyr::filter(
prompt %in% c("believe", "bereave")
) |>
dplyr::mutate(
prompt = droplevels(as.factor(prompt)),
segment = droplevels(as.factor(segment))
)Running a full model
# running model
m1 <- lme4::lmer(joint_FPCs ~ L1 + segment + L1:segment + (1|speaker), dat = dat_time_en)
# summary
summary(m1)## Linear mixed model fit by REML ['lmerMod']
## Formula: joint_FPCs ~ L1 + segment + L1:segment + (1 | speaker)
## Data: dat_time_en
##
## REML criterion at convergence: 939
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3616 -0.5167 0.0165 0.5169 3.2661
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker (Intercept) 1.656 1.287
## Residual 3.199 1.789
## Number of obs: 224, groups: speaker, 29
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1.1123 0.4484 2.481
## L1Japanese -1.5152 0.5949 -2.547
## segment/ɹ/ 0.4139 0.3438 1.204
## L1Japanese:segment/ɹ/ 1.0109 0.4801 2.106
##
## Correlation of Fixed Effects:
## (Intr) L1Jpns sgmn//
## L1Japanese -0.754
## segment/ɹ/ -0.403 0.304
## L1Jpns:sg// 0.289 -0.408 -0.716
Model comparison for interaction L1:segment
# nested model to test interaction
m2 <- lme4::lmer(joint_FPCs ~ L1 + segment + (1|speaker), dat = dat_time_en)
# model comparison
anova(m1, m2, test = "Chisq")## refitting model(s) with ML (instead of REML)
## Data: dat_time_en
## Models:
## m2: joint_FPCs ~ L1 + segment + (1 | speaker)
## m1: joint_FPCs ~ L1 + segment + L1:segment + (1 | speaker)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m2 5 952.52 969.58 -471.26 942.52
## m1 6 950.10 970.57 -469.05 938.10 4.4213 1 0.03549 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## $emmeans
## segment = /l/:
## L1 emmean SE df lower.CL upper.CL
## English 1.112 0.448 35.6 0.202 2.022
## Japanese -0.403 0.391 41.0 -1.193 0.387
##
## segment = /ɹ/:
## L1 emmean SE df lower.CL upper.CL
## English 1.526 0.441 33.5 0.629 2.424
## Japanese 1.022 0.396 42.7 0.224 1.820
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
##
## $contrasts
## segment = /l/:
## contrast estimate SE df t.ratio p.value
## English - Japanese 1.515 0.595 37.8 2.547 0.0151
##
## segment = /ɹ/:
## contrast estimate SE df t.ratio p.value
## English - Japanese 0.504 0.593 37.2 0.851 0.4004
##
## Degrees-of-freedom method: kenward-roger
GAMMs
We can also directly model the PC1 trajectories using GAMMs, as opposed to the FPCA approach where we almost always need a combination with additional statistical tests.
We’ll build on the data frame dat_sub that has omitted
the DLC knot information already (as it’s irrelevant for our
analysis).
Defining variables
Let’s define the variables of interest. We’ll build a model with
LangSeg for the interaction between L1 and
segment.
## [1] "speaker" "L1" "rec_date" "prompt" "interval"
## [6] "prop_time" "repetition" "token_id" "PC1z" "PC2z"
# add 'segment', convert variables into factor and create an interaction variable
dat_sub <- dat_sub |>
dplyr::mutate(
segment = case_when(
interval == "ili" ~ "/l/",
interval == "iri" ~ "/ɹ/",
interval == "idi" ~ "/ɾ/"
),
segment = as.factor(segment),
L1 = as.factor(L1),
speaker = as.factor(speaker),
prompt = as.factor(prompt),
LangSeg = interaction(L1, segment)
)
# also make sure that prop_time is numeric
dat_sub <- dat_sub |>
dplyr::mutate(
prop_time = as.numeric(prop_time)
)Ordered variables
A key preparation before fitting a GAMMs model is to convert the variables into ordered variable to fit difference smooths.
Fitting the first model
# fitting a model
m1 <- mgcv::bam(PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) + s(prop_time, speaker, bs = "fs", m = 1), data = dat_sub, method = "fREML")
# summary
summary(m1)##
## Family: gaussian
## Link function: identity
##
## Formula:
## PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) +
## s(prop_time, speaker, bs = "fs", m = 1)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10695 0.02904 3.682 0.000235 ***
## LangSeg.ordJapanese./l/ -0.12517 0.03984 -3.142 0.001695 **
## LangSeg.ordEnglish./ɹ/ -0.20452 0.04016 -5.092 3.77e-07 ***
## LangSeg.ordJapanese./ɹ/ -0.37558 0.03984 -9.427 < 2e-16 ***
## LangSeg.ordJapanese./ɾ/ 0.27767 0.04290 6.472 1.13e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(prop_time) 7.439 8.198 34.910 < 2e-16 ***
## s(prop_time):LangSeg.ordJapanese./l/ 3.363 4.021 8.341 1.42e-06 ***
## s(prop_time):LangSeg.ordEnglish./ɹ/ 1.344 1.602 0.808 0.551
## s(prop_time):LangSeg.ordJapanese./ɹ/ 5.650 6.712 12.839 < 2e-16 ***
## s(prop_time):LangSeg.ordJapanese./ɾ/ 5.860 6.952 17.177 < 2e-16 ***
## s(prop_time,speaker) 93.822 259.000 1.511 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.513 Deviance explained = 53.3%
## fREML = 3272.7 Scale est. = 0.48254 n = 2981
Autocorrelation correction
# mark start.event
## arrange the data appropriately
dat_sub <- dat_sub |>
dplyr::arrange(token_id, prop_time)
## mark time = 1 (as we express prop_time between 1 and 11)
dat_sub$start.event <- dat_sub$prop_time == 1
# check residual autocorrelations
itsadug::acf_resid(m1)Fitting a model with AR(1) model
# fitting a model
m2 <- mgcv::bam(PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) + s(prop_time, speaker, bs = "fs", m = 1), data = dat_sub, method = "fREML", rho = rho_m1, AR.start = dat_sub$start.event)
# summary
summary(m2)##
## Family: gaussian
## Link function: identity
##
## Formula:
## PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) +
## s(prop_time, speaker, bs = "fs", m = 1)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09830 0.05725 1.717 0.086121 .
## LangSeg.ordJapanese./l/ -0.11746 0.07926 -1.482 0.138459
## LangSeg.ordEnglish./ɹ/ -0.18690 0.07765 -2.407 0.016153 *
## LangSeg.ordJapanese./ɹ/ -0.36720 0.07931 -4.630 3.82e-06 ***
## LangSeg.ordJapanese./ɾ/ 0.28673 0.08554 3.352 0.000813 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(prop_time) 8.242 8.496 31.519 < 2e-16 ***
## s(prop_time):LangSeg.ordJapanese./l/ 4.957 5.805 4.769 0.000112 ***
## s(prop_time):LangSeg.ordEnglish./ɹ/ 1.000 1.000 0.169 0.681334
## s(prop_time):LangSeg.ordJapanese./ɹ/ 7.596 8.337 10.914 < 2e-16 ***
## s(prop_time):LangSeg.ordJapanese./ɾ/ 7.460 8.288 14.084 < 2e-16 ***
## s(prop_time,speaker) 151.405 259.000 2.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.509 Deviance explained = 53.9%
## fREML = 1883.1 Scale est. = 0.39258 n = 2981
Model comparison
Shape + height difference
Let’s test an overall effect of LangSeg.ord onto the
PC1z trajectory pattern.
# full model with ML
m2 <- mgcv::bam(PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) + s(prop_time, speaker, bs = "fs", m = 1), data = dat_sub, method = "ML", rho = rho_m1, AR.start = dat_sub$start.event)
# nested model without parametric or smooth terms assocaited with LangSeg
m3 <- mgcv::bam(PC1z ~ s(prop_time), data = dat_sub, method = "ML", rho = rho_m1, AR.start = dat_sub$start.event)
# model comparison
itsadug::compareML(m2, m3, suggest.report = TRUE)## m2: PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) +
## s(prop_time, speaker, bs = "fs", m = 1)
##
## m3: PC1z ~ s(prop_time)
##
## Report suggestion: The Chi-Square test on the ML scores indicates that model m2 is [significantly?] better than model m3 (X2(14.00)=322.488, p<2e-16).
## -----
## Model Score Edf Difference Df p.value Sig.
## 1 m3 2190.595 3
## 2 m2 1868.106 17 322.488 14.000 < 2e-16 ***
##
## AIC difference: -793.96, model m2 has lower AIC.
Shape difference
We’ll now test whether there is a shape difference between trajectories.
# full model with ML
m2 <- mgcv::bam(PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) + s(prop_time, speaker, bs = "fs", m = 1), data = dat_sub, method = "ML", rho = rho_m1, AR.start = dat_sub$start.event)
# nested model only with smooth term for LangSeg
m4 <- mgcv::bam(PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, speaker, bs = "fs", m = 1), data = dat_sub, method = "ML", rho = rho_m1, AR.start = dat_sub$start.event)
# model comparison
itsadug::compareML(m2, m4, suggest.report = TRUE)## m2: PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, by = LangSeg.ord) +
## s(prop_time, speaker, bs = "fs", m = 1)
##
## m4: PC1z ~ LangSeg.ord + s(prop_time) + s(prop_time, speaker, bs = "fs",
## m = 1)
##
## Report suggestion: The Chi-Square test on the ML scores indicates that model m2 is [significantly?] better than model m4 (X2(8.00)=108.434, p<2e-16).
## -----
## Model Score Edf Difference Df p.value Sig.
## 1 m4 1976.540 9
## 2 m2 1868.106 17 108.434 8.000 < 2e-16 ***
##
## AIC difference: -217.44, model m2 has lower AIC.
Data visualisation
separate smooths using itsadug
Let’s check if the PC1 trajectories are modelled accurately. We’ll first check the separate smooths for each liquid segment, which seems to suggest that the modelling looks good.
# separate smooths 1
## English
plot_smooth(m2, view = "prop_time", plot_all = "LangSeg.ord", cond = list(LangSeg.ord = c("English./l/", "English./ɹ/")))## Summary:
## * LangSeg.ord : factor; set to the value(s): English./l/, English./ɹ/.
## * prop_time : numeric predictor; with 30 values ranging from 1.000000 to 11.000000.
## * speaker : factor; set to the value(s): f9japd. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(prop_time,speaker)
##
## Japanese
plot_smooth(m2, view = "prop_time", plot_all = "LangSeg.ord", cond = list(LangSeg.ord = c("Japanese./l/", "Japanese./ɹ/", "Japanese./ɾ/")))## Summary:
## * LangSeg.ord : factor; set to the value(s): Japanese./l/, Japanese./ɹ/, Japanese./ɾ/.
## * prop_time : numeric predictor; with 30 values ranging from 1.000000 to 11.000000.
## * speaker : factor; set to the value(s): f9japd. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(prop_time,speaker)
##
For comparison, here is the raw PC1 trajectories:
separate and difference smooths using itsadug
Now let’s check how L1 English and L1 Japanese speakers differ in the patterns of their PC1 trajectoties. We’ll visualise separate by-group smooths for each liquid segment first, followed by difference smooths encoding the by-group differences.
# English /l/
## separate smooths
plot_smooth(m2, view = "prop_time", plot_all = "LangSeg.ord", cond = list(LangSeg.ord = c("English./l/", "Japanese./l/")))## Summary:
## * LangSeg.ord : factor; set to the value(s): English./l/, Japanese./l/.
## * prop_time : numeric predictor; with 30 values ranging from 1.000000 to 11.000000.
## * speaker : factor; set to the value(s): f9japd. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(prop_time,speaker)
##
## difference smooth
plot_diff(m2, view = "prop_time", comp = list(LangSeg.ord = c("English./l/", "Japanese./l/")), print.summary = FALSE, main = "Japanese-English difference for /l/")# English /ɹ/
## separate smooths
plot_smooth(m2, view = "prop_time", plot_all = "LangSeg.ord", cond = list(LangSeg.ord = c("English./ɹ/", "Japanese./ɹ/")))## Summary:
## * LangSeg.ord : factor; set to the value(s): English./ɹ/, Japanese./ɹ/.
## * prop_time : numeric predictor; with 30 values ranging from 1.000000 to 11.000000.
## * speaker : factor; set to the value(s): f9japd. (Might be canceled as random effect, check below.)
## * NOTE : The following random effects columns are canceled: s(prop_time,speaker)
##
## difference smooth
plot_diff(m2, view = "prop_time", comp = list(LangSeg.ord = c("English./ɹ/", "Japanese./ɹ/")), print.summary = FALSE, main = "Japanese-English difference for /ɹ/")tidygam approach
We can also visualise the GAMMs results more neatly using
tidygam.
# inspect gams
m2_preds <- tidygam::predict_gam(m2, exclude_terms = "s(prop_time,speaker)")
# show prediction
m2_preds## # A tibble: 55 × 6
## LangSeg.ord prop_time PC1z se lower_ci upper_ci
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Japanese./l/ 1 -0.169 0.375 -0.903 0.566
## 2 Japanese./ɹ/ 1 -0.0436 0.376 -0.780 0.692
## 3 Japanese./ɾ/ 1 0.212 0.387 -0.546 0.970
## 4 English./l/ 1 -0.683 0.126 -0.931 -0.436
## 5 English./ɹ/ 1 -0.836 0.286 -1.40 -0.277
## 6 Japanese./l/ 2 -0.303 0.318 -0.926 0.320
## 7 Japanese./ɹ/ 2 -0.157 0.322 -0.788 0.475
## 8 Japanese./ɾ/ 2 0.156 0.333 -0.497 0.809
## 9 English./l/ 2 -0.827 0.104 -1.03 -0.624
## 10 English./ɹ/ 2 -0.987 0.247 -1.47 -0.504
## # ℹ 45 more rows
# separate two main effects
m2_preds_2 <- tidygam::predict_gam(
m2,
length_out = 25,
exclude_terms = "s(prop_time,speaker)",
separate = list(LangSeg.ord = c("L1", "segment"))
)
# get gamm prediction
m2_preds_2 |>
plot(series = "prop_time", comparison = "segment") +
scale_colour_manual(values = cbPalette) +
scale_fill_manual(values = cbPalette) +
facet_grid(cols = vars(L1))Session Info
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 15.3.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidygam_0.2.0 itsadug_2.4.1 plotfunctions_1.4 mgcv_1.9-0
## [5] nlme_3.1-163 emmeans_1.8.9 lmerTest_3.1-3 lme4_1.1-35.1
## [9] Matrix_1.6-1.1 fdapace_0.5.9 ggpubr_0.6.0 lubridate_1.9.4
## [13] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [17] readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
## [21] tidyverse_2.0.0 rmdformats_1.0.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 fastmap_1.1.1
## [4] pracma_2.4.4 janitor_2.2.0.9000 digest_0.6.36
## [7] rpart_4.1.21 timechange_0.3.0 estimability_1.4.1
## [10] lifecycle_1.0.4 cluster_2.1.4 magrittr_2.0.3
## [13] compiler_4.3.2 rlang_1.1.5 Hmisc_5.1-1
## [16] sass_0.4.9 tools_4.3.2 utf8_1.2.4
## [19] yaml_2.3.8 data.table_1.14.8 knitr_1.45
## [22] ggsignif_0.6.4 labeling_0.4.3 htmlwidgets_1.6.4
## [25] RColorBrewer_1.1-3 abind_1.4-8 withr_3.0.2
## [28] foreign_0.8-85 numDeriv_2016.8-1.1 nnet_7.3-19
## [31] grid_4.3.2 xtable_1.8-4 colorspace_2.1-1
## [34] scales_1.3.0 MASS_7.3-60 insight_0.19.6
## [37] cli_3.6.4 mvtnorm_1.2-5 rmarkdown_2.26
## [40] generics_0.1.3 rstudioapi_0.15.0 tzdb_0.4.0
## [43] minqa_1.2.6 cachem_1.0.8 splines_4.3.2
## [46] parallel_4.3.2 base64enc_0.1-3 vctrs_0.6.5
## [49] boot_1.3-28.1 jsonlite_1.8.8 carData_3.0-5
## [52] bookdown_0.42 car_3.1-2 hms_1.1.3
## [55] pbkrtest_0.5.2 rstatix_0.7.2 Formula_1.2-5
## [58] htmlTable_2.4.2 jquerylib_0.1.4 glue_1.8.0
## [61] nloptr_2.0.3 cowplot_1.1.3 stringi_1.8.4
## [64] gtable_0.3.6 munsell_0.5.1 pillar_1.10.1
## [67] htmltools_0.5.8.1 R6_2.6.1 evaluate_0.23
## [70] lattice_0.21-9 highr_0.10 backports_1.5.0
## [73] snakecase_0.11.1 broom_1.0.5 bslib_0.7.0
## [76] Rcpp_1.0.14 coda_0.19-4.1 gridExtra_2.3
## [79] checkmate_2.3.2 xfun_0.50 pkgconfig_2.0.3